Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
نویسندگان
چکیده
The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit, and the spatial locality exhibited by the applications, in addition to the amount of parallelism in the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. A mathematical framework that allows a clean description of the relationship between spatial locality and false sharing is derived in this paper. First, a technique to identify a severe form of multiple-writer false sharing is presented. The importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing is then demonstrated. Given the conflicting requirements, a compiler-based approach to this problem holds promise. This paper investigates the use of data transformations in addressing spatial locality and false sharing, and derives an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI/Cray Origin 2000 multiprocessor, our approach brings an additional 9 percent improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, the presented approach obtains an additional 19 percent improvement over an optimization technique that is oriented specifically toward reducing false sharing. This study also reveals that, in addition to reducing synchronization costs and improving the memory subsystem performance, obtaining large granularity parallelism is helpful in balancing the effects of enhancing locality and reducing false sharing, rendering them compatible.
منابع مشابه
Odin: Design and Evaluation of a Single Address Space Multiprocessor
Odin is a new high performance single address space multiprocessor design. The contribution of this investigation is the synthesis of three important new methods into a unified system which maximises data locality and significantly reduces data access latencies. To achieve high performance Odin uses a segmented stack to maintain data locality after thread migration, and a memory mapping that di...
متن کاملUnified Locality-Sensitive Signatures for Transactional Memory
Transactional Memory (TM) systems must record the memory locations read and written by concurrent transactions in order to detect conflicts. Some TM implementations use signatures for this purpose, which summarize read and write sets in bounded hardware at the cost of false positives due to address aliasing. Signatures are usually implemented as two separate (one for reads and another for write...
متن کاملDesign and Evaluation of a Subblock Cache Coherence Protocol for Bus-Based Multiprocessors
Parallel applications exhibit a wide variety of memory reference patterns. Designing a memory architecture that serves all applications well is not easy. However, because tolerating or reducing memory latency is a priority in e ective parallel processing, it is important to explore new techniques to reduce memory tra c. In this paper, we describe a snoopy cache coherence protocol that uses a la...
متن کاملFalse Sharing and Spatial Locality in Multiprocessor Caches
The performance of the data cache in shared-memory multiprocessors has been shown to be diierent from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can signiicantly limit the performance of multiprocessors....
متن کاملFalse Sharing ans Spatial Locality in Multiprocessor Caches
The performance of the data cache in sharedmemory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Trans. Parallel Distrib. Syst.
دوره 14 شماره
صفحات -
تاریخ انتشار 2003